188 research outputs found

    Safe Functional Inference for Uncharacterized Viral Proteins

    Get PDF
    The explosive growth in the number of sequenced genomes has created a flood of protein sequences with unknown structure and function. A routine protocol for functional inference on an input query sequence is based on a database search for homologues. Searching a query against a non-redundant database using BLAST (or more advanced methods, e.g. PSI-BLAST) suffers from several drawbacks: (i) a local alignment often dominates the results; (ii) the reported statistical score (i.e. E-value) is often misleading; (iii) incorrect annotations may be falsely propagated. 
Several systematic methods are commonly used to assign sequences with functions on a genomic scale. In Pfam (1) and resources alike, statistical profiles (HMMs) are built from semi-manual multiple alignments of seed homologous sequences. The profiles are then used to scan genomic sequences for additional family members. The drawbacks of this scheme are: (i) only families with a predetermined seed are considered; (ii) the query must have a detectable sequence similarity to seed sequences; (iii) attention to internal relationships among the family members or the relations to other families is lacking; (iv) family membership is often set by pre-determined thresholds.
An alternative to profile or model based methods for functional inference relies on a hierarchical clustering of the protein space, as implemented in the ProtoNet approach (2). The fundamental principle is the creation of a tree that captures evolutionary relatedness among protein families. The tree construction is fully automatic, and is based only on reported BLAST similarities among clustered sequences. The tree provides protein groupings in continuous evolutionary granularities, from closely related to distant superfamilies. Clusters in the ProtoNet tree show high correspondence with homologous sequence (i.e. Pfam and InterPro), functional (i.e. E.C. classification) and structural (i.e., SCOP) families (3). A new clustering scheme (4) has provided an extensive update to the ProtoNet process, which is now based on direct clustering of all detectable sequence similarities. 
Herein, we use the ProtoNet resource to develop a methodology for a consistent and safe functional inference for remote families. We illustrate the success of our approach towards clusters of poorly characterized viral proteins. Viral sequences are characterized by a rapid evolutionary rate which drives viral families to be even more remote (sequence-similarity-wise). Thus, functional inference for viral families is apparently an unsolved task. Despite this inherent difficulty, the new ProtoNet tree scaffold reliably captures weak evolutionary connections for viral families, which were previously overlooked. We take advantage of this, and propose new functional assignments for viral protein families.
&#xa

    EVEREST: a collection of evolutionary conserved protein domains

    Get PDF
    Protein domains are subunits of proteins that recur throughout the protein world. There are many definitions attempting to capture the essence of a protein domain, and several systems that identify protein domains and classify them into families. EVEREST, recently described in Portugaly et al. (2006) BMC Bioinformatics, 7, 277, is one such system that performs the task automatically, using protein sequence alone. Herein we describe EVEREST release 2.0, consisting of 20 029 families, each defined by one or more HMMs. The current EVEREST database was constructed by scanning UniProt 8.1 and all PDB sequences (total over 3 000 000 sequences) with each of the EVEREST families. EVEREST annotates 64% of all sequences, and covers 59% of all residues. EVEREST is available at . The website provides annotations given by SCOP, CATH, Pfam A and EVEREST. It allows for browsing through the families of each of those sources, graphically visualizing the domain organization of the proteins in the family. The website also provides access to analyzes of relationships between domain families, within and across domain definition systems. Users can upload sequences for analysis by the set of EVEREST families. Finally an advanced search form allows querying for families matching criteria regarding novelty, phylogenetic composition and more

    Codon usage is associated with the evolutionary age of genes in metazoan genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Codon usage may vary significantly between different organisms and between genes within the same organism. Several evolutionary processes have been postulated to be the predominant determinants of codon usage: selection, mutation, and genetic drift. However, the relative contribution of each of these factors in different species remains debatable. The availability of complete genomes for tens of multicellular organisms provides an opportunity to inspect the relationship between codon usage and the evolutionary age of genes.</p> <p>Results</p> <p>We assign an evolutionary age to a gene based on the relative positions of its identified homologues in a standard phylogenetic tree. This yields a classification of all genes in a genome to several evolutionary age classes. The present study starts from the observation that each age class of genes has a unique codon usage and proceeds to provide a quantitative analysis of the codon usage in these classes. This observation is made for the genomes of <it>Homo sapiens</it>, <it>Mus musculus</it>, and <it>Drosophila melanogaster</it>. It is even more remarkable that the differences between codon usages in different age groups exhibit similar and consistent behavior in various organisms. While we find that GC content and gene length are also associated with the evolutionary age of genes, they can provide only a partial explanation for the observed codon usage.</p> <p>Conclusion</p> <p>While factors such as GC content, mutational bias, and selection shape the codon usage in a genome, the evolutionary history of an organism over hundreds of millions of years is an overlooked property that is strongly linked to GC content, protein length, and, even more significantly, to the codon usage of metazoan genomes.</p

    Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex

    Get PDF
    BACKGROUND: Daphnia pulex (Water flea) is the first fully sequenced crustacean genome. The crustaceans and insects have diverged from a common ancestor. It is a model organism for studying the molecular makeup for coping with the environmental challenges. In the complete proteome, there are 30,550 putative proteins. However, about 10,000 of them have no known homologues. Currently, the UniProtoKB reports on 95% of the Daphnia's proteins as putative and uncharacterized proteins. RESULTS: We have applied ProtoNet, an unsupervised hierarchical protein clustering method that covers about 10 million sequences, for automatic annotation of the Daphnia's proteome. 98.7% (26,625) of the Daphnia full-length proteins were successfully mapped to 13,880 ProtoNet stable clusters, and only 1.3% remained unmapped. We compared the properties of the Daphnia's protein families with those of the mouse and the fruitfly proteomes. Functional annotations were successfully assigned for 86% of the proteins. Most proteins (61%) were mapped to only 2953 clusters that contain Daphnia's duplicated genes. We focused on the functionality of maximally amplified paralogs. Cuticle structure components and a variety of ion channels protein families were associated with a maximal level of gene amplification. We focused on gene amplification as a leading strategy of the Daphnia in coping with environmental toxicity. CONCLUSIONS: Automatic inference is achieved through mapping of sequences to the protein family tree of ProtoNet 6.0. Applying a careful inference protocol resulted in functional assignments for over 86% of the complete proteome. We conclude that the scaffold of ProtoNet can be used as an alignment-free protocol for large-scale annotation task of uncharacterized proteomes

    ProTeus: identifying signatures in protein termini

    Get PDF
    ProTeus (PROtein TErminUS) is a web-based tool for the identification of short linear signatures in protein termini. It is based on a position-based search method for revealing short signatures in termini of all proteins. The initial step in ProTeus development was to collect all signature groups (SIGs) based on their relative positions at the termini. The initial set of SIGs went through a sequential process of inspection and removal of SIGs, which did not meet the attributed statistical thresholds. The SIGs that were found significant represent protein sets with minimal or no overall sequence similarity besides the similarity found at the termini. These SIGs were archived and are presented at ProTeus. The SIGs are sorted by their strong correspondence to functional annotation from external databases such as GO. ProTeus provides rich search and visualization tools for evaluating the quality of different SIGs. A search option allows the identification of terminal signatures in new sequences. ProTeus (ver 1.2) is available at

    Viral Proteins Acquired from a Host Converge to Simplified Domain Architectures

    Get PDF
    The infection cycle of viruses creates many opportunities for the exchange of genetic material with the host. Many viruses integrate their sequences into the genome of their host for replication. These processes may lead to the virus acquisition of host sequences. Such sequences are prone to accumulation of mutations and deletions. However, in rare instances, sequences acquired from a host become beneficial for the virus. We searched for unexpected sequence similarity among the 900,000 viral proteins and all proteins from cellular organisms. Here, we focus on viruses that infect metazoa. The high-conservation analysis yielded 187 instances of highly similar viral-host sequences. Only a small number of them represent viruses that hijacked host sequences. The low-conservation sequence analysis utilizes the Pfam family collection. About 5% of the 12,000 statistical models archived in Pfam are composed of viral-metazoan proteins. In about half of Pfam families, we provide indirect support for the directionality from the host to the virus. The other families are either wrongly annotated or reflect an extensive sequence exchange between the viruses and their hosts. In about 75% of cross-taxa Pfam families, the viral proteins are significantly shorter than their metazoan counterparts. The tendency for shorter viral proteins relative to their related host proteins accounts for the acquisition of only a fragment of the host gene, the elimination of an internal domain and shortening of the linkers between domains. We conclude that, along viral evolution, the host-originated sequences accommodate simplified domain compositions. We postulate that the trimmed proteins act by interfering with the fundamental function of the host including intracellular signaling, post-translational modification, protein-protein interaction networks and cellular trafficking. We compiled a collection of hijacked protein sequences. These sequences are attractive targets for manipulation of viral infection

    Cooperativity within proximal phosphorylation sites is revealed from large-scale proteomics data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Phosphorylation is the most prevalent post-translational modification on eukaryotic proteins. Multisite phosphorylation enables a specific combination of phosphosites to determine the speed, specificity and duration of biological response. Until recent years, the lack of high quality data limited the possibility for analyzing the properties of phosphorylation at the proteome scale and in the context of a wide range of conditions. Thanks to advances of mass spectrometry technologies, thousands of phosphosites from in-vivo experiments were identified and archived in the public domain. Such resource is appropriate to derive an unbiased view on the phosphosites properties in eukaryotes and on their functional relevance.</p> <p>Results</p> <p>We present statistically rigorous tests on the spatial and functional properties of a collection of ~70,000 reported phosphosites. We show that the distribution of phosphosites positioning along the protein tends to occur as dense clusters of Serine/Threonines (pS/pT) and between Serine/Threonines and Tyrosines, but generally not as much between Tyrosines (pY) only. This phenomenon is more ubiquitous than anticipated and is pertinent for most eukaryotic proteins: for proteins with β‰₯ 2 phosphosites, 54% of all pS/pT sites are within 4 amino acids of another site. We found a strong tendency for clustered pS/pT to be activated by the same kinase. Large-scale analyses of phosphopeptides are thus consistent with a cooperative function within the cluster.</p> <p>Conclusions</p> <p>We present evidence supporting the notion that clusters of pS/pT but generally not pY should be considered as the elementary building blocks in phosphorylation regulation. Indeed, closely positioned sites tend to be activated by the same kinase, a signal that overrides the tendency of a protein to be activated by a single or only few kinases. Within these clusters, coordination and positional dependency is evident. We postulate that cellular regulation takes advantage of such design. Specifically, phosphosite clusters may increase the robustness of the effectiveness of phosphorylation-dependent response.</p> <p>Reviewers</p> <p>Reviewed by Joel Bader, Frank Eisenhaber, Emmanuel Levy (nominated by Sarah Teichmann). For the full reviews, please go to the Reviewers' comments section.</p

    Synaptic proteins as multi-sensor devices of neurotransmission

    Get PDF
    Neuronal communication is tightly regulated in time and space. Following neuronal activation, an electrical signal triggers neurotransmitter (NT) release at the active zone. The process starts by the signal reaching the synapse followed by a fusion of the synaptic vesicle (SV) and diffusion of the released NT in the synaptic cleft. The NT then binds to the appropriate receptor and induces a membrane potential change at the target cell membrane. The entire process is controlled by a fairly small set of synaptic proteins, collectively called SYCONs. The biochemical features of SYCONs underlie the properties of NT release. SYCONs are characterized by their ability to detect and respond to changes in environmental signals. For example, consider synaptotagmin I (Syt1), a prototype of a protein family with over 20 gene and variants in mammals. Syt1 is a specific example of a multi-sensor device with a large repertoire of discrete states. Several of these states are stimulated by a local concentration of signaling molecules such as Ca2+. The ability of this protein to sense signaling molecules and to adopt multiple biochemical states is shared by other SYCONs such as the synapsins (Syns). Specific biochemical states of Syns determine the accessibility of SV for NT release. Each of these states is defined by a specific alternative spliced variant with a unique profile of phosphorylation modified sites. The plasticity of the synapse is a direct reflection of SYCON's multiple biochemical states. State transitions occurs in a wide range of time scales, and therefore these molecules need to cope with events that last milliseconds (i.e., exocytosis in fast responding synapses) and with events that can carry on for many minutes (i.e., organization of SV pools). We suggest that SYCONs are optimized throughout evolution as multi-sensor devices. A full repertoire of the switches leading to alternation of protein states and a detailed characterization of protein-protein network within the synapse is critical for the development of a dynamic model of synaptic transmission

    Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The complete proteome of the starlet sea anemone, <it>Nematostella vectensis</it>, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of <it>Hydra magnipapillata </it>and <it>Monosiga brevicollis</it>, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes.</p> <p>Results</p> <p>We found that 11-16% of <it>N. vectensis </it>proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the <it>N. Vectensis </it>proteome has about 3300 unique TR-units, but only a small fraction of them are shared with <it>H. magnipapillata, M. brevicollis</it>, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra.</p> <p>Conclusions</p> <p>While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.</p

    A functional hierarchical organization of the protein sequence space

    Get PDF
    BACKGROUND: It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity. RESULTS: In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust. CONCLUSIONS: We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins
    • …
    corecore